High Performance Cyberinfrastructure Enables Data-Driven Science in the Globally Networked World
1. High Performance Cyberinfrastructure
Enables Data-Driven Science
in the Globally Networked World
Keynote Presentation
Sequencing Data Storage and Management Meeting at
The X-GEN Congress and Expo
San Diego, CA
March 14, 2011
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
1
Follow me on Twitter: lsmarr
2. Abstract
High performance cyberinfrastructure (10Gbps dedicated optical channels end-
to-end) enables new levels of discovery for data-intensive research projects—
such as next generation sequencing. In addition to international and national
optical fiber infrastructure, we need local campus high performance research
cyberinfrastructure (HPCI) to provide ―on-ramps,‖ as well as scalable
visualization walls and compute and storage clouds, to augment the emerging
remote commercial clouds. I will review how UCSD has built out just such a
HPCI and is in the process of connecting it to a variety of high throughput
biomedical devices. I will show how high performance collaboration
technologies allow for distributed interdisciplinary teams to analyze these large
data sets in real-time.
3. Two Calit2 Buildings Provide
Laboratories for ―Living in the Future‖
• ―Convergence‖ Laboratory Facilities
– Nanotech, BioMEMS, Chips, Radio, Photonics
– Virtual Reality, Digital Cinema, HDTV, Gaming
• Over 1000 Researchers in Two Buildings
– Linked via Dedicated Optical Networks
UC San Diego
UC Irvine
www.calit2.net
Over 400 Federal Grants, 200 Companies
4. The Required Components of
High Performance Cyberinfrastructure
• High Performance Optical Networks
• Scalable Visualization and Analysis
• Multi-Site Collaborative Systems
• End-to-End Wide Area CI
• Data-Intensive Campus Research CI
5. The OptIPuter Project: Creating High Resolution Portals
Over Dedicated Optical Channels to Global Science Data
Scalable
OptIPortal Adaptive
Graphics
Environment
(SAGE)
Picture
Source:
Mark
Ellisman,
David Lee,
Jason Leigh
Calit2 (UCSD, UCI), SDSC, and UIC Leads—Larry Smarr PI
Univ. Partners: NCSA, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST
Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent
6. Visual Analytics--Use of Tiled Display Wall OptIPortal
to Interactively View Microbial Genome (5 Million Bases)
Acidobacteria bacterium Ellin345 Soil
Bacterium 5.6 Mb; ~5000 Genes
Source: Raj Singh, UCSD
7. Use of Tiled Display Wall OptIPortal
to Interactively View Microbial Genome
Source: Raj Singh, UCSD
8. Use of Tiled Display Wall OptIPortal
to Interactively View Microbial Genome
Source: Raj Singh, UCSD
9. Large Data Challenge: Average Throughput to End User
on Shared Internet is 10-100 Mbps
Tested
January 2011
Transferring 1 TB:
--50 Mbps = 2 Days
--10 Gbps = 15 Minutes
http://ensight.eos.nasa.gov/Missions/terra/index.shtml
10. Solution: Give Dedicated Optical Channels
to Data-Intensive Users
(WDM)
10 Gbps per User ~ 100-1000x
Shared Internet Throughput
c* f
Source: Steve Wallach, Chiaro Networks
―Lambdas‖
Parallel Lambdas are Driving Optical Networking
The Way Parallel Processors Drove 1990s Computing
11. Dedicated 10Gbps Lightpaths Tie Together
State and Regional Fiber Infrastructure
Interconnects
Two Dozen
State and Regional
Internet2 Dynamic Optical Networks
Circuit Network
Is Now Available
NLR 40 x 10Gb Wavelengths
12. The Global Lambda Integrated Facility--
Creating a Planetary-Scale High Bandwidth Collaboratory
Research Innovation Labs Linked by 10G Dedicated Lambdas
www.glif.is
Created in Reykjavik,
Iceland 2003
Visualization courtesy of
Bob Patterson, NCSA.
13. Launch of the 100 Megapixel OzIPortal Kicked Off
a Rapid Build Out of Australian OptIPortals
January 15, 2008
January 15, 2008
No Calit2 Person Physically Flew to Australia to Bring This Up!
Covise, Phil Weber, Jurgen Schulze, Calit2
CGLX, Kai-Uwe Doerr , Calit2
http://www.calit2.net/newsroom/release.php?id=1421
14. ―Blueprint for the Digital University‖--Report of the
UCSD Research Cyberinfrastructure Design Team
• Focus on Data-Intensive Cyberinfrastructure
April 2009
No Data
Bottlenecks
--Design for
Gigabit/s
Data Flows
research.ucsd.edu/documents/rcidt/RCIDTReportFinal2009.pdf
18. UCSD Planned Optical Networked
Biomedical Researchers and Instruments
• Connects at 10 Gbps :
CryoElectron
Microscopy Facility – Microarrays
San Diego – Genome Sequencers
Supercomputer – Mass Spectrometry
Center
– Light and Electron
Microscopes
– Whole Body Imagers
– Computing
– Storage
Cellular & Molecular
Medicine East
Calit2@UCSD
Bioengineering
Radiology
Imaging Lab
National
Center for
Microscopy
& Imaging Center for
Molecular Genetics
Pharmaceutical
Sciences Building Cellular & Molecular
Biomedical Research Medicine West
19. UCSD Campus Investment in Fiber Enables
Consolidation of Energy Efficient Computing & Storage
WAN 10Gb:
N x 10Gb/s CENIC, NLR, I2
Gordon –
HPD System
Cluster Condo
DataOasis
Triton – Petascale
(Central) Storage
Data Analysis
Scientific
Instruments
GreenLight Digital Data Campus Lab OptIPortal
Data Center Collections Cluster Tiled Display Wall
Source: Philip Papadopoulos, SDSC, UCSD
21. Calit2 Microbial Metagenomics Cluster-
Next Generation Optically Linked Science Data Server
Source: Phil Papadopoulos, SDSC, Calit2
512 Processors
~200TB
~5 Teraflops Sun
1GbE X4500
~ 200 Terabytes Storage and Storage
10GbE
Switched 10GbE
/ Routed
Core
4000 Users
From 90 Countries
22. OptIPuter Persistent Infrastructure Enables
Calit2 and U Washington CAMERA Collaboratory
Photo Credit: Alan Decker Feb. 29, 2008
Ginger
Armbrust’s
Diatoms:
Micrographs,
Chromosomes,
Genetic
Assembly
iHDTV: 1500 Mbits/sec Calit2 to
UW Research Channel Over NLR
23. Creating CAMERA 2.0 -
Advanced Cyberinfrastructure Service Oriented Architecture
Source:
CAMERA CTO
Mark Ellisman
24. The GreenLight Project:
Instrumenting the Energy Cost of Computational Science
• Focus on 5 Communities with At-Scale Computing Needs:
– Metagenomics
– Ocean Observing
– Microscopy
– Bioinformatics
– Digital Media
• Measure, Monitor, & Web Publish
Real-Time Sensor Outputs
– Via Service-oriented Architectures
– Allow Researchers Anywhere To Study Computing Energy Cost
– Enable Scientists To Explore Tactics For Maximizing Work/Watt
• Develop Middleware that Automates Optimal Choice
of Compute/RAM Power Strategies for Desired Greenness
• Data Center for School of Medicine Illumina Next Gen
Sequencer Storage and Processing
Source: Tom DeFanti, Calit2; GreenLight PI
25. Moving to Shared Enterprise Data Storage & Analysis
Resources: SDSC Triton Resource & Calit2 GreenLight
http://tritonresource.sdsc.edu Source: Philip Papadopoulos, SDSC, UCSD
SDSC
Large Memory SDSC Shared
Nodes Resource
• 256/512 Cluster
GB/sys • 24 GB/Node
• 8TB Total • 6TB Total
• 128 GB/sec • 256 GB/sec
x256
• ~ 9 TF x28 • ~ 20 TF
UCSD Research Labs
SDSC Data Oasis
Large Scale Storage
• 2 PB
• 50 GB/sec
• 3000 – 6000 disks
• Phase 0: 1/3 TB,
8GB/s
N x 10Gb/s Campus
Research
Network
Calit2 GreenLight
26. NSF Funds a Data-Intensive Track 2 Supercomputer:
SDSC’s Gordon-Coming Summer 2011
• Data-Intensive Supercomputer Based on
SSD Flash Memory and Virtual Shared Memory SW
– Emphasizes MEM and IOPS over FLOPS
– Supernode has Virtual Shared Memory:
– 2 TB RAM Aggregate
– 8 TB SSD Aggregate
– Total Machine = 32 Supernodes
– 4 PB Disk Parallel File System >100 GB/s I/O
• System Designed to Accelerate Access
to Massive Data Bases being Generated in
Many Fields of Science, Engineering, Medicine,
and Social Science
Source: Mike Norman, Allan Snavely SDSC
27. Data Mining Applications
will Benefit from Gordon
• De Novo Genome Assembly
from Sequencer Reads &
Analysis of Galaxies from
Cosmological Simulations
& Observations
• Will Benefit from
Large Shared Memory
• Federations of Databases &
Interaction Network
Analysis for Drug
Discovery, Social Science,
Biology, Epidemiology, Etc.
• Will Benefit from
Low Latency I/O from Flash
Source: Mike Norman, SDSC
28. Rapid Evolution of 10GbE Port Prices
Makes Campus-Scale 10Gbps CI Affordable
• Port Pricing is Falling
• Density is Rising – Dramatically
• Cost of 10GbE Approaching Cluster HPC Interconnects
$80K/port
Chiaro
(60 Max)
$ 5K
Force 10
(40 max) ~$1000
(300+ Max)
$ 500
Arista $ 400
48 ports Arista
48 ports
2005 2007 2009 2010
Source: Philip Papadopoulos, SDSC/Calit2
30. Calit2 CAMERA Automatic Overflows
into SDSC Triton
@ SDSC
Triton Resource
@ CALIT2
Transparently CAMERA -
Sends Jobs to Managed
Submit Portal Job Submit
on Triton Portal (VM)
10Gbps
Direct
Mount
CAMERA ==
DATA No Data
Staging
31. California and Washington Universities Are Testing
a 10Gbps Connected Commercial Data Cloud
• Amazon Experiment for Big Data
– Only Available Through CENIC & Pacific NW
GigaPOP
– Private 10Gbps Peering Paths
– Includes Amazon EC2 Computing & S3 Storage
Services
• Early Experiments Underway
– Robert Grossman, Open Cloud Consortium
– Phil Papadopoulos, Calit2/SDSC Rocks
32. Academic Research OptIPlanet Collaboratory:
A 10Gbps ―End-to-End‖ Lightpath Cloud
HD/4k Live Video
HPC
Local or Remote
Instruments
End User
OptIPortal National LambdaRail
10G
Lightpaths
Campus
Optical Switch
Data Repositories & Clusters HD/4k Video Repositories
This is a production cluster with it’s own Force10 e1200 switch. It is connected to quartzite and is labeled as the “CAMERA Force10 E1200”.We built CAMERA this way because of technology deployed successfully in Quartzite